Reinforcement Learning by Phil Winder Ph.D

Reinforcement Learning by Phil Winder Ph.D

Author:Phil Winder Ph.D.
Language: eng
Format: epub
Publisher: O'Reilly Media
Published: 2020-08-26T00:00:00+00:00


Tip

The performance of a robust policy should degrade gracefully in the presence of adversity. Policies that have uncharted regions near the optimal trajectory are not robust because the tiniest of deviations could lead to states with an undefined policy.

Next I implemented soft Q-learning (see “Soft Q-Learning (and Derivatives)”) and set the temperature parameter to 0.05. You can see the results in Figure 7-3. The optimal policies are unremarkable, even though they generally point in the right direction. But if you look closely, you can see that the action values and the direction arrows are smaller or nonexistent on the Q-learning side of the plot. Soft Q-learning has visited states other than the optimal trajectory. You can also see that the white colors representing larger action values are shifted toward the center of the images for soft Q-learning. The reason is explained by the candy store example; entropy promotes states with the greatest amount of choice.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.